1 Introduction

Real-time depth inference of a given object is an essential computer vision task which can be applied in various robotic tasks such as simultaneous localization and mapping [1,2,3] as well as autonomous quality inspection in industrial applications [4, 5]. As the popularity of VR applications has continued to grow, instant depth estimation has also become an integral part of modeling complex 3D information out of single 2D images of human faces [6, 7] or body parts [8,9,10]. Depth information about an object can be directly obtained from sensors for optical distance measurement. Time-of-Flight (ToF) cameras, LIDAR or stereo imaging systems are often used in practice and were also employed to generate paired RGB-depth data from some well-known depth databases [1, 2, 8, 10,11,12,13]. Since these sensors are typically costly and time-consuming devices that are also sensitive to external influences, their applicability to fast full-image depth generation on small on-site devices is limited. These limitations have motivated depth synthesis out of a simpler modality in terms of acquisition effort, namely a conventional RGB image. This development has initiated a completely new field of research in computer vision.

An important contribution in that area was made by Eigen et al. [14], who proposed deep convolutional neural networks (DCNNs) for monocular depth synthesis of indoor and outdoor scenes. Basically, monocular single-image depth estimation out of RGB images can be seen as a modality transfer in which observed data of one modality is mapped to desired properties of another, potentially more complex, modality. Although DCNNs are promising tools that succeed on such transfer tasks, they are commonly based on large amounts of training data, and generation and acquisition can be a demanding task. In the supervised setting in particular, DCNNs make use of paired training data during network parameter optimization, i.e., the network is provided with a single-view RGB and corresponding per-pixel depth [6, 10, 14, 15]. Since large-scale dense depth profiles are not abundant in many applications, supervised approaches are not feasible for these objects. One possible way to address these shortcomings of supervised methods is to consider self-supervised approaches based on monocular video clips in which a supervisory depth counterpart is extracted from pose changes between adjacent frames.

These models can be trained on RGB sequences in a self-supervised manner, where a depth network and a pose estimation network are simultaneously optimized via sophisticated view-synthesis losses [3, 16,17,18]. Obviously, these methods require non-static scenes or a moving camera position (e.g., moving humans [18], autonomous driving [2]).

A very recent example for a scenario, where neither video sequences, stereo pairs nor paired data are available, is non-destructive evaluation of internal combustion engines for stationary power generation [4, 5]. Within this application, surface depth information has to be extracted from RGB image data. With current standards, cylinder condition can be assessed from a depth profile on a micrometer scale of the measured area (cf. Fig. 1). However, microscopic depth sensing of cylinder liner surface areas is a time-consuming and resource-intensive task which consists of disassembling the liner, removing it from the engine, cutting it into segments and measuring them with a highly expensive and stationary confocal microscope [4]. With a handheld microscope, however, single RGB records of the liner’s inner surface can be generated from which depth profiles may be synthesized. Since depth data is generated on a quite small scale (\({1.9\,\hbox {mm}\times 1.9}\,\hbox {mm}\)) and is comparatively high resolved, it is hardly possible to generate RGB data with accurately aligned pixel positions. This results in an unsupervised approach required for reasonable depth synthesis of this static scene.

Fig. 1
figure 1

Top: RGB measurements of the inner surface of three cylinder liners with a spatial range of \({4.2\,\hbox {mm}\times 4.2}\,\hbox {mm}\), recorded by a handheld microscope. Bottom: Depth profile of the same cylinder with a spatial range of \({1.9\,\hbox {mm}\times 1.9}\,\hbox {mm}\), measured with a confocal microscope. The pixels of the modalities are not aligned

The main objective of this study is to propose a general method for depth estimation out of scenes for which neither paired data, video sequences, nor stereo pairs are available. For this, we consider the depth estimation problem as an intermodal transfer task of single images. Several recent advances in unpaired modality transfer are based on generative adversarial models (GAN) [19], cycle-consistency [20] and probabilistic distance measures [21, 22]. The method proposed in this paper builds on established model architectures and training strategies in deep learning which are beneficially combined for unpaired single-view depth synthesis. Introduction of a novel perceptual reconstruction term in combination with appropriate hand-crafted filters further improves accuracy and depth contours.

The method is comprehensively tested on the afore mentioned industrial application of surface depth estimation. Additionally, the approach is applied to other, external, datasets to create realistic scenarios where perfectly aligned RGB-depth data of single images is not available in practice. More precisely, we test the model on the Texas 3D Face Recognition database (Texas-3DFRD) [12], the Bosphorus-3DFA [11] and the CelebAMask-HQ [23] to show its plausibility for facial data in an unsupervised setting.

The SURREAL dataset [9] is used to test performance on RGB-D videos of human bodies, where RGB and depth frames are not perfectly aligned. For every evaluation experiment, the depth accuracy of the proposed framework is compared to state-of-the-art methods in unsupervised single-image transfer. To be more precise, the methods used for comparison are standard cycleGAN [20], CUT [24] that uses contrastive learning for one-sided transfer and gcGAN [25] that utilizes geometric constraints between modalities. For facial data, we additionally compare to Wu et al. [26], a very recent work where in addition to the depth profile also the albedo image, the illumination source and a symmetry confidence map is predicted in an unsupervised manner.

Contributions:

  • This study finds a solution to the industrial problem of single-shot surface depth estimation where no paired data, no video sequences and no stereo pairs are available.

  • In this work, depth estimation is considered as a single-image modality transfer; the proposed method shows superior performance over state-of-the-art works, quantitatively and qualitatively.

  • Application to the completely different tasks of unsupervised face and human body depth synthesis indicates the universality of the approach.

2 Related work

The following section summarizes important milestones in the development of generative adversarial networks and highlights important work on single-image depth estimation as well as depth synthesis via GANs. In the supplementary, background is provided on some 3D databases that have been critical to the development of deep learning-based models for depth estimation.

2.1 Generative adversarial networks

A standard GAN [19] consists of a generator network \(G:\mathcal Z\rightarrow \mathcal X\) mapping from a low-dimensional latent space \(\mathcal Z\) to image space \(\mathcal X\), where parameters of the generator are adapted so that the distribution of generated examples assimilates the distribution of a given data set. To be able to assess any similarity between arbitrary high-dimensional image distributions, a discriminator \(f:\mathcal X\rightarrow [0,1]\) is trained simultaneously to distinguish between generator distribution and real data distribution. In a two-player min-max game, generator parameters are then updated to fool a steadily improving discriminator.

Usage of the initially proposed discriminator approach can cause the vanishing gradient problem and does not provide any information on the real distance between the generator and the real distribution. This issue has been discussed thoroughly in [21], where the problem is bypassed by replacing the discriminator with a critic network that approximates the Wasserstein-1 distance [27] between the real distribution and the generator distribution.

While the quintessence of GANs is to draw synthetic instances following a given data distribution, cycle-consistent GANs [20] allow one-to-one mappings between two image domains \(\mathcal X\) and \(\mathcal Y\). In essence, two generator networks \(G_{\mathcal Y}:\mathcal X\rightarrow \mathcal Y,G_{\mathcal X}:\mathcal Y\rightarrow \mathcal X\) and corresponding discriminator networks \(f_{\mathcal Y}:\mathcal Y\rightarrow [0,1],f_\mathcal X:\mathcal X \rightarrow [0,1]\) are trained simultaneously to enable generation of synthetic instances for both image domains (e.g., synthesizing winter landscapes from summer scenes and vice versa). To ensure one-to-one correspondence, a cycle-consistency term is added to the two adversarial loss functionals. While cycle-consistent GANs had initially been constructed for style transfer purposes, they were also very well received in the area of modality transfer in biomedical applications [28,29,30]. Since optimization and fine-tuning of GANs often turns out to be extremely demanding and time-intensive, much research has emphasized stabilization of the training process through the development of stable network architectures such as DCGAN [31] or PatchGAN [32].

2.2 Monocular depth estimation

Deep learning-based methods achieve state-of-the-art results on depth synthesis task by training a DCNN on a large-scale and extensive data set [1, 2]. Most of RGB-based models are supervised, i.e. they require corresponding depth data that is pixel-wise aligned. One of the first DCNN approaches by Eigen et al. [14] included sequential deployment of a coarse-scale stack and a refinement module and was benchmarked on the KITTI [2] and the NYU Depth v2 data set [1]. Using an encoder–decoder structure in combination with an adversarial loss term helped to increase visual quality of the dense depth estimates [33]. Later methods also considered deep residual networks [34] or deep ordinal regression networks [35] in order to significantly increase performance on these data sets, where commonly considered performance measures are the root mean squared error (RMSE) or the \(\delta _1\) accuracy [3]. Since a lot of research focused on further performance increase at the expense of model complexity and runtime, Wofk et al. [36] used a lightweight network architecture [37] and achieved comparable results.

2.3 Depth estimation using GAN

Use of left-right consistency and a GAN architecture results in excellent unsupervised depth estimation based on stereo images [38, 39]. In [40] and [41], a GAN has been trained to perform unpaired depth synthesis out of single monocular images. To this end, GANs were employed in the context of domain adaptation using an additional synthesized data set of the same application with paired samples. This approach may not be regarded as a fully unsupervised method and requires availability or construction of a synthetic dataset. Arslan et Seke [6] consider a conditional GAN (CGAN) [32] for solving single-image face depth synthesis. Nevertheless, CGANs rely on paired data since the adversarial part estimates the plausibility of an input–output pair. Another interesting approach was taken in [15], where indoor depth and segmentation were estimated simultaneously using cycle-consistent GANs. The cycle-consistency loss helped them to maintain the characteristics of the RGB input during depth synthesis, while the simultaneous segmentation resolved the fading problem in which depth information is hidden by larger features. However, the proposed discriminator network and reconstruction term in the generator loss function are based on paired RGB and depth/segmentation data, which is not available for the aforementioned industrial application of surface depth synthesis.

3 Method

This section proposes an approach to monocular single-image depth synthesis with unpaired data and discusses the introduced framework and training strategy in detail.

Fig. 2
figure 2

Illustration of the proposed framework: The left part describes the domains in which the RGB-to-depth generator \(G_{\theta _\mathcal {Y}}\) and the contrary depth-to-RGB generator \(G_{\theta _\mathcal {X}}\) operate. Both generators are updated via the probabilistic Wasserstein-1 distance, estimated by \(f_{\omega _\mathcal {Y}}\) in the input and \(f_{\omega _\mathcal {X}}\) in the target domain. Perceptual similarity is compared between each generator input and its reconstruction. The right plot indicates that during inference, only \(G_{\theta _\mathcal {Y}}\) has to be deployed to synthesize new depth profiles. RGB images and ground truth depth images were taken from the Texas-3DFRD [12]

3.1 Setting and GAN architecture

The underlying structure of the proposed modality synthesis is two GANs linked with a reconstruction term (cf. Fig. 2). To be more exact, let \(\mathcal X \subset [0,255]^{d_1\times d_2\times 3}\) and \(\mathcal Y \subset {\mathbb {R}}^{d_1\times d_2\times 1}\) denote the domain of RGB and depth images, respectively, where the number of image pixels \(d_1\cdot d_2\) is the same in both domains. Furthermore, let \(X{:}{=}\{x_1,\ldots ,x_M\}\) be the set of M given RGB images and \(Y{:}{=}\{y_1,\ldots ,y_N\}\) the set of N available but unaligned depth profiles. \(P_\mathcal {X}\) and \(P_\mathcal {Y}\) denote the distributions of the images in both domains. The proposed model includes a generator function \(G_{{\theta _\mathcal {Y}}}:\mathcal X\rightarrow \mathcal Y\), which aims to map an input RGB image to a corresponding depth counterpart in the target domain. A generator function for image transfer may be approximated by a DCNN, which is parameterized by a weight vector \({\theta _\mathcal {Y}}\) consisting of several convolution kernels. By adjusting \({\theta _\mathcal {Y}}\), the distribution of generator outputs \(P_{{\theta _\mathcal {Y}}}\) may be brought closer to the real data distribution in the depth domain \(P_\mathcal {Y}\). Note we do not know what \(P_{{\theta _\mathcal {Y}}}\) and \(P_\mathcal {Y}\) actually look like, we only have access to unpaired training samples \(G_{{\theta _\mathcal {Y}}}(x)\sim P_{{\theta _\mathcal {Y}}},\ x\in X\) and \(y\sim P_\mathcal {Y},\ y\in Y\). An adversarial approach is deployed to ensure assimilation of both high-dimensional distributions in the GAN setting. The distance between the generator distribution and the real distribution is estimated by an additional DCNN \(f_{\omega _\mathcal {Y}}:\mathcal Y \rightarrow {\mathbb {R}}\), which is parameterized by weight vector \({\omega _\mathcal {Y}}\) and is trained simultaneously with the generator network since \(P_{\theta _\mathcal {Y}}\) changes after each update to the generator weights \({\theta _\mathcal {Y}}\). This ensures that \(G_{\theta _\mathcal {Y}}\) can be pitted against a steadily improving loss network \(f_{\omega _\mathcal {Y}}\) [19].

This research work has chosen a network critic based on the Wasserstein-1 distance [21, 27]. The Wasserstein-1 distance (earth mover distance) between two distributions \(P_1\) and \(P_2\) is defined as \( {\mathcal {W}}_1(P_1, P_2) {:}{=}\inf _{J\in {\mathcal {J}}(P_1,P_2)}{\mathbb {E}}_{(x,y)\sim J}\left\Vert x-y\right\Vert \), where the infimum is taken over the set of all joint probability distributions that have marginal distributions \(P_1\) and \(P_2\). Since the exact computation of the infimum is highly intractable, the Kantorovich–Rubinstein duality [27] is used

$$\begin{aligned} {\mathcal {W}}_1(P_1,P_2) =\sup _{\left\Vert f\right\Vert _L\le 1}\left[ \underset{y\sim P_1}{{\mathbb {E}}}f(y)- \underset{y\sim P_{2}}{{\mathbb {E}}}f(y)\right] , \end{aligned}$$
(1)

where \(\left\Vert \cdot \right\Vert _L\le C\) denotes that a function is C-Lipschitz. Equation (1) indicates that a good approximation to \({\mathcal {W}}_1(P_\mathcal {Y},P_{\theta _\mathcal {Y}})\) is found by maximizing the distance \({{\mathbb {E}}}_{y\sim P_\mathcal {Y}}f _{\omega _\mathcal {Y}}(y)- {{\mathbb {E}}}_{y\sim P_{\theta _\mathcal {Y}}}f_{\omega _\mathcal {Y}}(y)\) over the set of DCNN weights \(\{{\omega _\mathcal {Y}}\mid f_{\omega _\mathcal {Y}}:\mathcal {Y}\rightarrow {\mathbb {R}}\ \text {1-Lipschitz}\}\), where the Lipschitz continuity of \(f_{\omega _\mathcal {Y}}\) can be enhanced via a gradient penalty [22]. Given training batches \({\textbf{y}}=\{y_n\}_{n=1}^b,\ y_n \overset{\textrm{iid}}{\sim } P_\mathcal {Y}\) and \({\textbf{x}}=\{x_n\}_{n=1}^b,\ x_n\overset{\textrm{iid}}{\sim } P_\mathcal {X}\), this yields the following empirical risk for critic \(f_{\omega _\mathcal {Y}}\):

$$\begin{aligned} \begin{aligned} \mathcal R_\text {cri}({\omega _\mathcal {Y}},{\theta _\mathcal {Y}},p,{\textbf{y}},{\textbf{x}})&\,{:}{=}\, \frac{1}{b}\sum _{n=1}^{b}\bigg [ f_{\omega _\mathcal {Y}}(G_{\theta _\mathcal {Y}}(x_n))-f_{\omega _\mathcal {Y}}(y_n)\\&\quad +p\cdot \left( \Big (\left\Vert \nabla _{{\tilde{y}}_n}f_{\omega _\mathcal {Y}}(\tilde{y}_n)\right\Vert _2-1\Big )_+ \right) ^2\bigg ], \end{aligned} \end{aligned}$$
(2)

where p denotes the influence of the gradient penalty, \(( \cdot ) _+{:}{=}\max (\{0,\cdot \})\) and \({\tilde{y}}_n {:}{=}\epsilon _n\cdot G_{\theta _\mathcal {Y}}(x_n)+ (1-\epsilon _n)\cdot y_n\) for \(\epsilon _n\overset{\textrm{iid}}{\sim } \mathcal U[0,1]\). The goal of the RGB-to-depth generator \(G_{\theta _\mathcal {Y}}\) is to minimize the distance. Since only the first term of the functional in (2) depends on the generator weights \({\theta _\mathcal {Y}}\), the adversarial empirical risk for generator \(G_{\theta _\mathcal {Y}}\) simplifies as follows:

$$\begin{aligned} \mathcal R_\text {adv}({\theta _\mathcal {Y}},{\omega _\mathcal {Y}},{\textbf{x}}){:}{=}-\frac{1}{b}\sum _{n=1}^{b}f_{\omega _\mathcal {Y}}(G_{\theta _\mathcal {Y}}(x_n)). \end{aligned}$$
(3)

3.2 Perceptual reconstruction

In the context of depth synthesis, it is not sufficient to ensure that the output samples lie in the depth domain. Care must be taken that synthetic depth profiles do not become irrelevant to the input. A reconstruction constraint forces generator input and output to share same spatial structure by taking into account the similarity between the input and the reconstruction of the synthesized depth profile. Obviously, calculation of a reconstruction error requires an opposite generator function \(G_{\theta _\mathcal {X}}:\mathcal Y\rightarrow \mathcal X\) to assimilate real RGB distribution \(P_\mathcal {X}\) as well as the corresponding distance network \(f_{\omega _\mathcal {X}}:\mathcal X \rightarrow {\mathbb {R}}\). Both have to be optimized simultaneously to the RGB-to-depth direction. The reconstruction error is commonly evaluated by assessing similarity between x and \(G_{\theta _\mathcal {X}}(G_{\theta _\mathcal {Y}}(x))\) as well as similarity between y and \(G_{\theta _\mathcal {Y}}(G_{\theta _\mathcal {X}}(y))\) for \(x\in \mathcal X\) and \(y\in \mathcal Y\). In the setting of style transfer and cycle-consistent GANs [20], a pixel-wise distance function on image space is considered, where the mean absolute error (MAE) or the mean squared error (MSE) is the common choice.

The use of a contrary generator \(G_{\theta _\mathcal {X}}\) can be viewed as a type of regularization since it prevents mode collapse, i.e., generator outputs remain dependent on the inputs. Deployment of the cycle-consistency approach [20], where reconstruction error is measured in image space, assumes no information loss during the modality transition. This corresponds to the applications of summer-to-winter landscape or photograph-to-Monet painting transition. Determining \(G_{\theta _\mathcal {Y}}\) and \(G_{\theta _\mathcal {X}}\) is an ill-posed problem since a single depth profile may be generated by an infinite number of distinct RGB images and vice versa [42]. For example, during RGB-to-depth transition of human faces, information on image brightness, light source or the subject’s skin color is lost. As a consequence, the contrary depth-to-RGB generator needed for regularization has to synthesize the lost properties of the image. Both generators \(G_{\theta _\mathcal {Y}}\) and \(G_{\theta _\mathcal {X}}\) may be penalized if the skin color or the brightness of the reconstruction is changed even though \(G_{\theta _\mathcal {X}}\) did exactly what we expected it to do, i.e., synthesize a face that is related to the input’s depth profile.

Adapting the idea of [43], we propose a perceptual reconstruction loss, i.e., instead of computing a reconstruction error in image space, we consider certain image features of the reconstruction. Typical perceptual similarity metrics extract features by propagating the images (to be compared) through an auxiliary network that is usually pretrained on a large image classification task [43,44,45]. Nevertheless, we expect our feature extractor to be perfectly tailored to our data and not determined by an additional network pretrained on a very general classification task [44] that may not even cover our type of data. Therefore, we enforce the reconstruction consistency on the image space by using the MAE loss on feature vectors extracted by \(\phi _\mathcal {X}(\cdot ){:}{=}f_{\omega _\mathcal {X}}^l(\cdot )\), which corresponds to the l-th layer of the RGB critic (cf. Algorithm 1). Analogously, we define the feature extractor on depth space by \(\phi _\mathcal {Y}(\cdot ){:}{=}f_{\omega _\mathcal {Y}}^l(\cdot )\), which corresponds to the l-th layer of the depth critic. Although we are aware that feature extractor weights are adjusted with each update of critic weights \({\omega _\mathcal {X}},{\omega _\mathcal {Y}}\), we assume that, at least at a later stage of training, \(\phi _\mathcal {X}\) and \(\phi _\mathcal {Y}\) have learned good and stable features on the image and depth domain. This yields the following empirical reconstruction risk:

$$\begin{aligned} \begin{aligned}&\mathcal R_\text {rec} ({\theta _\mathcal {X}},{\theta _\mathcal {Y}},\phi _\mathcal {X},\phi _\mathcal {Y},{\textbf{x}},{\textbf{y}})\,{:}{=}\, \frac{1}{b}\sum _{n=1}^b\text {MAE}\\&\qquad \big [\phi _\mathcal {X}\big (G_{\theta _\mathcal {X}}\left( G_{\theta _\mathcal {Y}}(x_n)\right) \big ),\phi _\mathcal {X}(x_n) \big ]\\&\quad + \frac{1}{b}\sum _{n=1}^b\text {MAE}\big [\phi _\mathcal {Y}\big (G_{\theta _\mathcal {Y}}\left( G_{\theta _\mathcal {X}}(y_n)\right) \big ),\phi _\mathcal {Y}(y_n) \big ]. \end{aligned} \end{aligned}$$
(4)

In our implementation, we set \(l{:}{=}L-2\) for a critic with L layers, i.e., we use the second-to-last layer of the critic.

A good reconstruction term must still be found for the start of training when the critic features are not yet sufficiently reliable. At first, it is desirable to guide the framework to preserve structural similarity during RGB-to-depth and depth-to-RGB transition. Therefore, we propose to compare the input and its reconstruction in the image space while automatically removing the brightness, illumination and color of the RGB images beforehand. This can be ensured by applying the following steps:

  1. 1.

    Convert the image to grayscale by applying the function \(g:[0,255]^{d_1\times d_2\times 3}\rightarrow {\mathbb {R}}^{d_1\times d_2},\quad x\mapsto \frac{0.299}{255}\cdot x_{(,,0)} +\frac{0.587}{255}\cdot x_{(,,1)}+\frac{0.144}{255}\cdot x_{(,,2)}\), where (, , i) denotes the i-th color channel for \(i=0,1,2\).

  2. 2.

    Enhance the brightness of the grayscale image using an automated gamma correction based on the image brightness [46], i.e. take the grayscale image \(x_\text {gr}\) to the power of \(\Gamma (x_\text {gr}){:}{=}{-0.3\cdot 2.303}/{\ln \overline{x_\text {gr}}}\), where \( \overline{x_\text {gr}}\) denotes the average of the gray values.

  3. 3.

    Convolve the enhanced image with a high-pass filter h in order to dim the lighting source and color information (cf. Fig. 3). The high-pass filter may be applied in Fourier domain, i.e., the 2D Fourier transform is multiplied by a Gaussian high-pass filter matrix \(H^\sigma \) defined by \(H^\sigma _{i,j}{:}{=}1-\exp \big (\left\Vert (i,j)-(\frac{d_1}{2},\frac{d_2}{2})\right\Vert _2^2 / (2\sigma ^2)\big )\) for \(i=1,\ldots ,d_1\) and \( j=1,\ldots ,d_2\). In our implementation, \(\sigma =4\) yielded satisfactory results for all tasks.

This yields the updated empirical reconstruction risk:

$$\begin{aligned} \begin{aligned}&\mathcal R_\text {rec} ({\theta _\mathcal {X}},{\theta _\mathcal {Y}},\phi _\mathcal {X},\phi _\mathcal {Y},\gamma ,{\textbf{x}},{\textbf{y}})\\&\quad {:}{=}\, \gamma \cdot \frac{1}{b}\sum _{n=1}^b\text {MAE}\big [\phi _\mathcal {X}\big (G_{\theta _\mathcal {X}}\left( G_{\theta _\mathcal {Y}}(x_n)\right) \big ),\phi _\mathcal {X}(x_n) \big ]\\&\qquad + \gamma \cdot \frac{1}{b}\sum _{n=1}^b\text {MAE}\big [\phi _\mathcal {Y}\big (G_{\theta _\mathcal {Y}}\left( G_{\theta _\mathcal {X}}(y_n)\right) \big ),\phi _\mathcal {Y}(y_n) \big ]\\&\qquad + (1-\gamma )\cdot \frac{1}{b}\sum _{n=1}^b\text {MAE}\big [\psi \big (G_{\theta _\mathcal {Y}}\left( G_{\theta _\mathcal {X}}(x_n)\right) \big ),\psi (x_n) \big ]\\&\qquad + (1-\gamma )\cdot \frac{1}{b}\sum _{n=1}^b\text {MAE}\big [G_{\theta _\mathcal {Y}}\left( G_{\theta _\mathcal {X}}(y_n)\right) ,y_n \big ], \end{aligned} \end{aligned}$$
(5)

where \(\psi (\cdot ){:}{=}h*g(\cdot )^{\Gamma (g(\cdot ))}\) and \(\gamma \) is gradually increased from 0 to 1 during training to control feature extractor reliability. In the far right column in Fig. 3, we may observe the strong effect of operator \(\psi \). For the face sample, the face shape and the positions of the nose and the eyes are very clear, at the same time the low image brightness and the exposure direction are resolved. The main edges of the cylinder liner surfaces are clearly identifiable, whereas the different brown levels and illumination inconsistencies of the input are no longer visible.

Fig. 3
figure 3

The first column visualizes the RGB samples and the second column the grayscale versions. The third column contains the gamma corrected counterparts, where the contrast in lower gray levels is enhanced for dark images in particular. The last column illustrates the application of the high-pass filter

Using the previously discussed risk functions \(\mathcal R_\text {cri}\) (2), \(\mathcal R_\text {adv}\) (3) and \(\mathcal R_\text {rec}\) (5), Algorithm 1 summarizes the proposed architecture for fully unsupervised single-view depth estimation. Implementation of the proposed framework is publicly available on https://github.com/anger-man/unsupervised-depth-estimation.

Algorithm 1
figure a

Proposed Framework

3.3 Network implementation

As critical as the loss function design of an unsupervised method is the choice of an appropriate architecture for the critic and the generator network. A decoder for the critic is built following the PacthGAN critic that was initially proposed in [32] with nearly \(15.7\times 10^{6}\) parameters. The PatchGAN architecture has been found to perform quite stably over a variety of different generative task and is part of many state-of-the-art architectures for image generation [20, 24, 47]. The generator is a ResNet18 [48] with a depth-specific upsampling part taken from [17] (\(19.8\times 10^{6}\) parameters). Detailed information on critic and generator implementations is provided in the supplementary.

4 Experiments and discussion

The framework proposed in Algorithm 1 is implemented with the publicly available TensorFlow framework [49]. The applications are inner surface depth estimation of cylinder liners, face depth estimation based on the Texas-3DFRD [12] and body depth synthesis using the SURREAL dataset [9]. In this section, we benchmark the proposed framework on each dataset and separately present the results, followed by a discussion at the end. As discussed in the introduction, the methods used for comparison are cycleGAN [20], gcGAN [47] and CUT [24]. For CUT, we use the public github repository.Footnote 1 The benchmark methods cycleGAN and gcGAN use the same critic and generator implementations as the method proposed in this study (cf. Sect. 3.3). For cycleGAN, we remove the novel perceptual loss and hand-crafted image filters from our method and replace them with MAE reconstruction loss. For gcGAN, the contrary generator is removed and up-down flip is employed as the geometric constraint.

An ablation study is conducted in order to highlight the impact and necessity of the novel hand-crafted filters and the perceptual reconstruction loss proposed here. More precisely, we set \(\psi \triangleq \text {Id}\) in (5) in order to avoid the hand-crafted filters, i.e., the reconstruction loss in the RGB domain is determined based on the MAE. Furthermore, we set \(\gamma \triangleq 0\) in (5) for the entire training process to study network behavior without perceptual reconstruction in both domains. The experiments without hand-crafted filters (w/o \(\psi \)) and without perceptual reconstruction (w/o \(\phi \)) are performed for the test cases of surface depth and face depth estimation (cf. Tables 1 and 2 ).

Table 1 Unsup. surface depth estimation: the reported metrics are RMSE and MAE of the ground truth and the synthesized depth and are evaluated on unseen data (smaller is better)
Table 2 Unsup. face depth estimation: the reported metrics are RMSE and MAE of the ground truth and the synthesized depth and are evaluated on unseen data (smaller is better)

In our implementation, we set the number of generator updates \(n_G\) to 10k, the minibatch size b to 8 and the penalty term p to 100. The number of critic iterations \(n_f\) is initially established to be 24 to ensure a good approximation of the Wasserstein-1 distance in the beginning. After 1000 generator updates, it is halved to speed up training. Furthermore, we set \(\alpha _f\) to \({5 \times 10^{-5}}\) and \(\alpha _G\) to \(1\times 10^{-4}\). The influence of the reconstruction term \(\lambda _\text {rec}\) is found for each dataset and method individually by a parameter grid search.

4.1 Surface depth

This study uses the same database initially proposed in [4] for depth estimation of inner cylinder liner surfaces of large internal combustion engines.

Depth measurements cover a spatial region of 1.9 mm \(\times \) 1.9 mm, have a dimension of approximately 4000\(\times \)4000 pixels and are acquired using a resource-intensive logistic chain as discussed in the introduction. The profiles denote relative depth with respect to the core area of the surface on a \(\upmu \textrm{m}\) scale. The RGB data is taken from the same cylinder surfaces with a simple handheld microscope. The RGB measurements cover a region of 4.2 mm \(\times \) 4.2 mm and have a resolution of nearly 1024\(\times \)1024 pixels. The smaller image area of the depth measurements is not registered in the larger RGB area, but the RGB instances are randomly cropped to 1.9 mm \(\times \) 1.9 mm to ensure the same spatial size between RGB and depth data. 592 random samples are obtained from each image domain. The RGB and depth data is then augmented separately to nearly 7000 samples via random cropping, flipping and gamma correction [46]. To make computation feasible on a NVIDIA GeForce RTX 2080 GPU, each sample is resized to a dimension of 256\(\times \)256 pixels. In order to assess the visual quality between two completely unaligned domains, we also generated depth profiles of 211 additional surface areas and registered them with great effort using shear transformations and a mutual information criterion. These evaluation samples are not included in the training database. During optimization, RGB images and depth profiles are scaled from [0, 255] to \([-1,1]\) and from \([-5,5]\) to \([-1,1]\), respectively, whereas evaluation metrics (RMSE and MAE) are calculated on the original depth scale in \(\upmu \textrm{m}\).

Table 3 Unsup. body depth estimation: the reported metrics are RMSE and MAE of the ground truth and the synthesized depth and are evaluated on unseen data (smaller is better)
Fig. 4
figure 4

From left to right: surface RGB input, ground truth and profiles predicted by our method, gcGAN and cycleGAN

4.2 Face depth

The Texas-3DFRD [12] consists of 118 individuals and a variety of facial expressions and corresponding depth profiles are available for each of them. Depth pixels represent absolute depth and their values are in [0, 1] where 1 represents the near clipping plane while 0 denotes the background. We randomly select 16 individuals as evaluation data and use the remaining samples as training data. For unsupervised training, we randomly select 50% of the training individuals for the input domain and use the depth images of the remaining 50% for the target domain. We resize all RGB frames and depth profiles to a dimension of \(256\times 256\) pixels. Data is augmented via flipping, histogram equalization and Gaussian blur to nearly 6300 samples per modality. During optimization, RGB images are scaled from [0, 255] to \([-1,1]\) and depth profiles are scaled from [0, 1] to \([-1,1]\), whereas the evaluation metrics RMSE and MAE are computed on the original depth scale.

More experiments on unsupervised facial depth synthesis on the Bosphorus-3DFA [11], the CelebAMask-HQ [23] and qualitative comparison to Wu et al. [26] are presented in the supplementary.

Fig. 5
figure 5

An instant 3D model generated by our proposed framework provides valuable information on the liner surface condition

4.3 Body depth

The SURREAL dataset [9] consists of nearly 68k video clips that show 145 different synthetic subjects performing various actions. The clips consist of 100 RGB frames with perfectly aligned depth profiles that denote real-world camera distance. We use the same train/test split as Varol et al. [9], i.e., we remove nearly 12.5 k clips and use the middle frame of each 100-frame clip for evaluation. For the remaining clips, an amount of 2500 clips is randomly selected for training. We choose 20 RGB and 20 depth frames per clip ensuring that RGB and depth frames are disjointed in order to mimic an application without any accurately aligned RGB-depth pairs. This results in approximately 50 k samples per modality. We strictly follow the preprocessing pipeline of Varol et al. [9], cropping each frame to the human bounding box and resizing/padding images to a dimension of \(256 \times 256\) pixels.

In addition, for each image, we subtract the median of depth values to fit the depth images into the range \(\pm 0.4725\) m, where values less or equal \(-0.4725\) denote background. During optimization, RGB images are scaled from [0, 255] to \([-1,1]\) and depth profiles are scaled from \([-0.4725,0.4725]\) to \([-1,1]\), whereas evaluation metrics RMSE and MAE are computed on the original depth scale in meters.

4.4 Discussion

Quantitative evaluation on unseen test data in Tables 1, 2 and 3 confirms superiority of the proposed method compared to other state-of-the-art modality transfer methods. In particular, the CUT method is not suitable for the depth estimation of planar surfaces and human bodies. Obviously, usage of a novel perceptual reconstruction term in combination with hand-crafted image filters is able to overcome the shortcomings of a standard cycle-consistency constraint as explained in Sect. 3.2 and improves depth accuracy significantly. Considering the industrial application, Fig. 4 and Fig. 5 indicates that we have been able to synthesize realistic surface depth profiles with an RMSE of 0.751 \(\upmu \textrm{m}\) compared to the registered ground truth. In Fig. 6, we observe that predictions coming from our method seem most similar to the ground truth, while the results of cycleGAN and CUT do not correctly reproduce the contours of the input. Plausibility of our depth predictions is also confirmed by the instant 3D model in Fig. 7. In Fig. 8, it can be seen that the CUT benchmark completely fails on the SURREAL dataset, which can possibly be attributed to the fact that here, in parallel to the depth estimation, the body must also be segmented.

Fig. 6
figure 6

From left to right: Face RGB input, ground truth and profiles predicted by our method, gcGAN, cycleGAN and CUT

Fig. 7
figure 7

An example of viewpoint augmentation using a 3D face model instantly generated by our proposed framework

Fig. 8
figure 8

From left to right: Body RGB input, ground truth and profiles predicted by the proposed method, gcGAN, cycleGAN and CUT

Although the proposed method was initially motivated by cycleGAN [20], it is important to point out that replacement of the standard cycle-consistency term with perceptual losses and usage of appropriate hand-crafted filters in image space is a novel idea that overcomes significant shortcomings of the standard cycleGAN architecture in depth estimation that are thoroughly discussed in the paper. For depth synthesis of surfaces, faces and human bodies, the RMSE decreases (compared to a standard cycleGAN) about 9.8%, 39.1% and 12.1%, respectively. Tables 1 and 2 show how the removal of the perceptual reconstruction loss (w/o \(\phi \)) and the hand-crafted filters (w/o \(\psi \)) reduces the accuracy of the proposed method. However, the use of perceptual-based reconstruction and the inclusion of hand-crafted filters each outperform the cycleGAN benchmark, with the combination of the two techniques providing the best performance in terms of evaluation metrics. The proposed method has been mainly developed to find a solution to the problem of depth synthesis of planar cylinder liner surfaces. The results confirm that the framework not only succeeds on the cylinder surface task but also significantly improves performance in the field of face and whole body depth synthesis compared to state-of-the-art modality transfer methods.

All three prototypical studies of single-shot depth prediction have in common that the color of the objects in the RGB instances has nearly no effect on the depth. This was the main motivation for the hand-crafted filters that convert RGB instances to gray values and remove low-frequency components. However, the motivation for these filters does not apply to all depth estimation problems. Indeed, there are examples where the color of the RGB instance could also give an indication of the depth of the observed scene. An example would be depth estimation from satellite images, i.e., modeling altitude from aerial imagery data. In such cases, the structure of the hand-crafted filters must be reconsidered and adjusted accordingly.

5 Conclusion

This paper proposes a framework for fully unsupervised single-shot depth estimation from monocular RGB images based on the Wasserstein-1 distance, a novel perceptual reconstruction loss and hand-crafted image filters. The model is comprehensively evaluated on differing depth synthesis tasks without using pairwise RGB and depth data during training. The approach provides a reasonable solution for estimating the relative depth of cylinder liner surfaces when generation of paired data is technically not feasible. Moreover, the proposed algorithm also shows promising results when applied to the task of absolute depth estimation of human bodies and faces, thereby proving that it may be generalized to other real-life tasks.

However, one disadvantage of the perceptual reconstruction approach is that four neural networks must be fitted in parallel.

Future work will therefore include the development of one-sided depth synthesis models in an unsupervised manner as well as the application of our approach to other modality transfer tasks.